Presentation: Tweet"Following Google or Don’t follow the followers, follow the leader"
It makes good sense to follow Google's lead with technology. Not because what Google does is particularly complex; it isn't. We follow Google for two reasons:
1. Google is operating at an unprecedented scale and every mistake they make is one we don't have to repeat, while every good decision they make (defined as "decisions that stick") is one we should probably emulate;
2. Google is as strong an attractor of brainpower and development talent as IBM's labs once were; that much intellectual horsepower – even if a large part of it is frittered away on the likes of Wave, Buzz, Lively and Aardvark – produces value, ultimately for all of us.
Implementing Hadoop is not following Google's lead. It's following Yahoo's lead, or more precisely, following the lead of the venture capital community who took a weak idea and made an industry of it by throwing money at any company with a repeat-offender CEO and the words "big data" in their Powerpoint. MapReduce is well behind state-of-the-art, to the point that Google discarded it as a cornerstone technology years ago.
The problems of scale, speed, persistence and context are the most important design problem we'll have to deal with during the next decade. Scale because we're creating and recording more data than at any time in human history – much of it of dubious value, but none of it obviously value-less. Speed because data flows now. Ceaselessly. In high volume. It has to be persisted in multiple latencies, from milliseconds to decades. And context because the context of creation is different from the context of transmission is different from the context of use.
Software development conceives of data in the form of lifeless, discrete, time-bound chunks – when in fact data is mutable, continuous, non-discrete. Data has a life and a lifecycle. As developers, we focus too much on the code that creates or processes data, too little on the lifecycle of the data itself.
There are a lot of red herrings, false premises and just-plain-dementia that get in the way of us seeing the problem clearly. We must work through what we mean by "structured" and "unstructured", what we mean by “big data” and why we need new technologies to solve some of our data problems. But “new technologies” doesn’t mean reinventing old technologies while ignoring the lessons of the past. There are reasons relational databases survived while hierarchical, document and object databases were market failures, technologies that may be poised to fail again, 20 years later.
What we believe about data’s structure, schema, and semantics are as important as the NoSQL and relational databases we use. The technologies impose constraints on the real problem: how we make sense of data in order to tell a computer what to do, or to inform human decisions. Most discussions of data and code lose the unconscious tradeoffs that are made when selecting these technology handcuffs.
What can following-Google, as a design principle, tell us about scale, speed, persistence and context? Perhaps that relational models of some sort will be in your future. That workloads are broader than a single application. That synthetic activities downstream from the point where data is recorded are as important as that initial point. That there are state-of-the-art developments pointing out possible futures for persistence layers processing.